Session 5 Practice: Manipulating datasets

Note: There are often multiple ways to answer each question.

Install and load the MASS and dplyr packages. Load the nlschools dataset.

#install.packages("MASS") (uncomment this line to install the package)
library(MASS)
library(dplyr)
data(nlschools)

How can we find a description of the nlschools dataset? Why is the class column a factor and not a numeric variable? Use some of the functions we learned to get a feel for the data.

?nlschools
str(nlschools)

## 'data.frame':    2287 obs. of  6 variables:
##  $ lang : int  46 45 33 46 20 30 30 57 36 36 ...
##  $ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
##  $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
##  $ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
##  $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

summary(nlschools)

##       lang             IQ            class            GS       
##  Min.   : 9.00   Min.   : 4.00   15580  :  33   Min.   :10.00  
##  1st Qu.:35.00   1st Qu.:10.50   5480   :  31   1st Qu.:23.00  
##  Median :42.00   Median :12.00   15980  :  31   Median :27.00  
##  Mean   :40.93   Mean   :11.83   16180  :  31   Mean   :26.51  
##  3rd Qu.:48.00   3rd Qu.:13.00   18380  :  31   3rd Qu.:31.00  
##  Max.   :58.00   Max.   :18.00   5580   :  30   Max.   :39.00  
##                                  (Other):2100                  
##       SES        COMB    
##  Min.   :10.00   0:1658  
##  1st Qu.:20.00   1: 629  
##  Median :27.00           
##  Mean   :27.81           
##  3rd Qu.:35.00           
##  Max.   :50.00           
##

head(nlschools)

##   lang   IQ class GS SES COMB
## 1   46 15.0   180 29  23    0
## 2   45 14.5   180 29  10    0
## 3   33  9.5   180 29  15    0
## 4   46 11.0   180 29  23    0
## 5   20  8.0   180 29  10    0
## 6   30  9.5   180 29  10    0

class is not a numeric variable as it represents class ID. The IDs do not have any meaningful ordering to them.

How many students are there in the dataset?

nrow(nlschools)

## [1] 2287

Create a new dataset which consists of students with verbal IQ >= 17.5.

nlschools %>% filter(IQ >= 17.5)

##    lang   IQ class GS SES COMB
## 1    51 18.0  2980 22  45    1
## 2    51 18.0  5480 32  50    0
## 3    51 17.5  5580 32  45    0
## 4    49 17.5  6280 26  40    0
## 5    54 17.5  6280 26  30    0
## 6    53 17.5  6280 26  50    0
## 7    50 17.5  9480 25  33    0
## 8    51 17.5 15980 33  20    0
## 9    50 17.5 16080 26  40    0
## 10   51 18.0 18480 29  23    0
## 11   51 18.0 19780 35  40    1
## 12   54 17.5 21880 27  50    0
## 13   51 17.5 22780 30  27    0

Create a new dataset which consists of students whose SES score is < 37 and whose class ID is 2980.

nlschools %>% filter(class == 2980 & SES < 37)

##   lang   IQ class GS SES COMB
## 1   44 14.0  2980 22  35    1
## 2   39  6.0  2980 22  35    1
## 3   41 12.5  2980 22  35    1

How many students had a language test score of more than 50?

nlschools %>% filter(lang > 50) %>%
    summarize(count = n())

##   count
## 1   360

How many students were there in each class? Which class had the most number of students?

nlschools %>% group_by(class) %>%
    summarize(count = n()) %>%
    arrange(desc(count))

## # A tibble: 133 x 2
##    class count
##    <fct> <int>
##  1 15580    33
##  2 5480     31
##  3 15980    31
##  4 16180    31
##  5 18380    31
##  6 5580     30
##  7 11580    30
##  8 19980    30
##  9 14880    29
## 10 3880     28
## # … with 123 more rows

Class 15580 had the most number of students (33).

Create a new column named pass which takes on the value “pass” if lang >= 40, “fail” otherwise. Save the dataset with the new column in a variable nlschools2, then show the first 10 rows of the dataset. (Hint: The ifelse function will be handy.)

nlschools2 <- nlschools %>% mutate(pass = ifelse(lang >= 40, "pass", "fail"))
head(nlschools2, n = 10)

##    lang   IQ class GS SES COMB pass
## 1    46 15.0   180 29  23    0 pass
## 2    45 14.5   180 29  10    0 pass
## 3    33  9.5   180 29  15    0 fail
## 4    46 11.0   180 29  23    0 pass
## 5    20  8.0   180 29  10    0 fail
## 6    30  9.5   180 29  10    0 fail
## 7    30  9.5   180 29  23    0 fail
## 8    57 13.0   180 29  10    0 pass
## 9    36  9.5   180 29  13    0 fail
## 10   36 11.0   180 29  15    0 fail

Your colleague hypothesizes that there is a strong relationship between IQ and social-economic stats (SES). Create a data frame which shows the mean IQ and language test scores of students for each SES value, and present the results in descending order of SES. Make plots of mean IQ and mean language score vs. SES.

nlschools3 <- nlschools %>% group_by(SES) %>%
    summarize(mean_IQ = mean(IQ),
              mean_lang = mean(lang)) %>%
    arrange(desc(SES))
head(nlschools3)

## # A tibble: 6 x 3
##     SES mean_IQ mean_lang
##   <int>   <dbl>     <dbl>
## 1    50    13.1      46.6
## 2    48    13.3      48.8
## 3    47    12.3      45.3
## 4    45    13.0      44.7
## 5    43    12.5      44.7
## 6    40    12.6      44.4

library(ggplot2)
ggplot(nlschools3) +
    geom_point(aes(x = SES, y = mean_IQ))

ggplot(nlschools3) +
    geom_point(aes(x = SES, y = mean_lang))

Get a random sample of 10 rows from the dataset. (Hint: Look at the documentation for the sample_n function in the dplyr package.)

set.seed(100)
nlschools %>% sample_n(size = 10)

##    lang   IQ class GS SES COMB
## 1    40 10.0  6180 27  25    0
## 2    51 17.5 22780 30  27    0
## 3    26  8.0  6081 26  33    1
## 4    28 11.5 22280 26  40    0
## 5    51 11.5 17680 24  20    0
## 6    50 12.0 10180 25  33    0
## 7    49 10.0 13780 32  18    1
## 8    20 12.0  3380 14  15    1
## 9    42 11.0 17580 27  23    1
## 10   54 12.0 15580 34  33    0

set.seed is a way to reset the random number generator so that every time we run the next line, we get the same random sample. This is helpful in trying to reproduce random results.

Session 5 Practice: Manipulating datasets

Kenneth Tay

Oct 8, 2019